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RESOLUTION AND SCALE INDEPENDENT FUNCTION 
MATCHING USING A STRING ENERGY PENALIZED 

SPLINE PRIOR 

By David M. Rogers*'*'* and Thomas L. Beck*'*> § 
University of Cincinnati 

The extension of the Bayesian penalized spline method to infer- 
ence on vector- valued functions is considered, with an emphasis on 
characterizing the suitability of the method for general application. 
We show that the standard quadratic penalty is exactly analogous to 
the energy of a stretched string, with the penalty parameter corre- 
sponding to its tension. This physical analogy motivates a discussion 
of resolution independence, which we define as the convergence of our 
function estimate to the nonparametric solution as the spline width 
decreases. The multidimensional context makes direct application of 
standard procedures for choosing the penalty parameter difficult, and 
a new method is proposed and extensively compared to the existing 
literature. An important class of problems which can be analyzed by 
this method are stochastic numerical integrators, which are consid- 
ered as an example problem. This work represents the first extension 
of penalized spline methods to inference on multidimensional numer- 
ical integrators reported in the literature. Several numerical calcu- 
lations illustrate the above points and address practical application 



1. Introduction. 

1.1. Motivation and Problem Background. Function estimation from ob- 
servational data is an important modeling tool because it allows us to explore 
the relationship between the independent variable and its response variable 
as well as predict observations yet to be made. Some simple example ap- 
plications involve estimating continuous changes in a property with respect 
to time (e.g. helmet acceleration during impact [32] or dose-response curves 
[15]). More advanced applications can involve more than one dimensional in- 
put, for example estimating scalar functions of space and/or time variables 
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[18, 6]. Here the function is modeled as a linear combination of functions 
with one or two-dimensional arguments, giving it the classification of a gen- 
eralized additive model [12]. Yet another generalization is possible when the 
function is vector- valued [38], for example the force on a particle moving in 
three dimensions [9]. We find in this last case the framework for analyzing 
a large class of novel applications such as stochastic integrators. 

In order for a function estimation method to be robust with respect to 
the experimental setup, it must satisfy two major requirements. First, the 
estimator must be resolution independent in the sense that it becomes stable 
in the limit of arbitrarily high resolution (where an acceptable approxima- 
tion to true function is almost certainly in the parameter space). Standard 
least squares estimators (maximum likelihood estimation using Eq. (1) with 
A = and without log \u terms) [25] do not satisfy this criterion. Smoothing 
splines were shown to have this property [27] and even an asymptotic equiv- 
alence to convolution kernel-based smoothing in the early work of B. Silver- 
man [31, 32], which was carried on by K. Messer [21] and D. Nychka [24]. A 
subsequent improvement in the original formulation came through divorcing 
the measurement positions, {r 1 }, from the spline knots [8] which allows for 
more efficient calculations. The resolution independence of penalized splines 
allows us to overcome the difficult "bin-width" problem encountered in linear 
least-squares fitting, where too low or too high resolution cause unphysically 
smooth curves or overfitting. Second, the estimator must be scale indepen- 
dent in the sense that it predicts the correct function shape when both the 
function and its noise are subject to arbitrary scaling. The Akaike informa- 
tion criterion (AIC) method fails this test, while generalized cross-validation 
(GCV) is scale independent but fails to separately address estimation of the 
sample variance. 

The physical spline device already satisfies the infinite resolution limit 
sought above for a nonparametric method. It consisted of a section of rubber 
held in place with push-pins at several points along a desired smooth curve 
[37]. This device could be numerically implemented by simply minimizing 
the potential energy function for (a suitable computational description of) 
a string's position subject to the given constraint points. However, modern 
applications of function fitting are often based on noisy measurements so 
that we cannot assume the input data {y, r}^ to constitute absolute con- 
straints. To modify the above device for this situation, imagine placing a 
push-pin at the location of each input measurement and making its connec- 
tion to the spline by attaching a vertical spring. This is, in fact, the system 
we are led to by the Bayesian analysis of Sec. 2 - where the role of the spring 
constants are played by z, the inverse of the measurement variance. In addi- 
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tion, the specification of an energy function leads to a complete description 
of the probability density for our spline's position if Boltzmann statistics 
are assumed. This assumption equates the minimum energy solution to the 
popular maximum likelihood estimate of statistics. 

Unfortunately, the system's potential energy function now depends jointly 
on the spring constants and the spline tension, A. This causes the ratio be- 
tween the two (A/2) to decide the form of the final solution. Nevertheless, it 
is an important result of this report that when proper prior probabilities are 
chosen for the tension and spring constants, scale-independence is achieved. 

In order to motivate the choice of B-spline basis functions used to describe 
/(r), we note some of their properties here. In keeping with the nonpara- 
metric requirement we require that the basis set be able to approximate any 
sufficiently smooth function to good accuracy. Using B-spline basis functions 
to represent the function space of interest allows a choice of the function's 
continuity, via the spline order, as well as its resolution, via the number 
of free parameters or knots. These choices can be important in numerical 
applications of the fitted function. In addition, for all choices of the knot 
spacing and spline order, the best B-spline approximation can be shown to 
have the same approximation error as a polynomial fitting [26]. 

Several issues in using penalized splines have been addressed by previ- 
ous studies. In attempts to understand scale dependence, several proposals 
for choosing the penalty scale parameter have been considered [28, 18, 15], 
including introduction of non-uniform (in r) penalty and/or variance pa- 
rameters [10, 5, 30, 4]. Adaptation to cases with non-Gaussian sampling 
error have been addressed by consideration of inference schemes with al- 
ternative likelihood functions [3]. Notably, several authors have interpreted 
the complete fitting process as a Bayesian inference problem [18]. This leads 
to consideration of the penalty function as a prior distribution on function 
space, with the attendant penalty parameter as a (nuisance) hyperparame- 
ter. 

1.2. Material Covered. In view of the above considerations, it seems 
that each new application area of P-splines should reconsider the proper 
choice of prior probability for function space and likelihood function for the 
problem at hand. This report will attempt to address these issues in the 
Bayesian formulation of P-splines, so that the method can be used in new 
applications with confidence. 

First, section 2 presents a physically motivated re-derivation of the prior 
penalty by likening the function to a stretched string. The shortcomings of 
the standard, scale-invariant prior probability for the penalty parameter are 
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re-examined and found to be due to the possibility of a degenerate polyno- 
mial solution. We show that we can exclude this possibility by introducing 
a string zero-point energy to modify the standard Jeffrey's prior, while re- 
taining scale invariance for arbitrarily large measurement scalings. 

Next we derive important properties of the proposed prior distribution in 
section 3. Resolution independence is proved by finding that the approxi- 
mation error of the n th derivative of a B-spline solution is of order h r ~ n . A 
review of the link between spline and convolution-based smoothing shows 
the role of higher-order derivatives in the penalty function. The marginal 
distribution of the penalty parameter serves to further clarify the role of 
the zero-point energy and greatly simplify sensitivity analysis to this fixed 
numerical value. The insight gained in these two sections allows automatic 
identification of the scale parameter in new application areas. 

Section 4 compares results from Markov chain Monte-Carlo (MCMC) sim- 
ulations on several test applications to provide a numerical test of the scale 
independence sought. An implementation of the multi-dimensional method 
for matching the output of a stochastic integrator is also reviewed and tested. 
For the particular problem studied, it is found that use of the posterior av- 
erage estimator gives significantly better results than maximum likelihood 
estimation for small sample sizes. 

Finally, notes on B-splines and computational implementation of the in- 
tegrations necessary for calculating higher order derivatives in the penalty 
function are provided in the appendix. 

1.3. Problem Formulation. We consider the regression problem where 
M (possibly noisy) measured values, {y l , r l }i , of a property y = /(r)+noise 
are to be fitted to an arbitrary function f(r) = Ylk=i Bk(tk( r )) = D(r) • 9. 
We allow for the possibility that y and r are vector-valued by choosing the 
design matrix D(r) 6 M^xp- in order to infer the value of the parameters 
9 6 R p , we derive the penalty function 



(1) 



logP (eXz\{y l ,r l }^l) = const. 

N 



Bt ] \tk{r)f p{t k )dt k + E 




(f " 1) 



2; 



Pk-n 
> 2 



1) log A* 



The indices are necessarily introduced by allowing each dimension of y to 
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have its own measurement variance af = z i 1 , and f(r) to be composed of the 
sum of Nf B-spline functions B(t) of one-dimensional variables £fc(r) € Sj fc 

and parameters 9j. such that X^fc^i Pk = P- Each of these functions has its 
own roughness penalty of order n, whose scale parameter (or tension) is 
We can analyze this complicated form by considering the simple case where 
iV and Nf are both one, making t\(r) = r and 



logP [9\z\{y l ,r l }f I) = const. 

M 



(2) 



4- - 

^ 2 



+ 



1=1 



;f -i)iog. 



;^-i)io g A. 



^{y l -B{r l )f + V Q 
B^ n \r) 2 p(r)dr + E 
2. Derivation of the Method. 



2.1. Thermal Equilibrium of a Stretched String. The prior probability 
for the space of all possible functions has been widely taken to be a Gaussian 
distribution on the function's spline parameters. The penalty matrix (inverse 
of the variance-covariance) is either an n th order difference on successive 
spline parameters or derived from an integral over the squared n th order 
derivative of the function - corresponding to an n th order random walk of 
the control points or the function f(r), respectively [8, 18]. 

Here, an appeal to the latter, traditional, penalty function is laid out based 
on consideration of the spline fit as a smooth string. In the tradition of the 
classical spline device [3 7], it seems appropriate to consider the function's 
n — 1 th derivative f^ n ~ l \r) as such a stretched string with tension T(r). 
Once a suitable energy function for the string is available, the assumption 
of Boltzmann statistics specifies a probability distribution for the string's 
position (and hence its parameters). 

Under Dirichlet boundary conditions, the standard expression for a string's 
potential energy is given by [using the notation from Eq. (2)]: 



(3) 



nf{r)] = \J T(r)||vW/( 



p(r)dr 



where p(r)dr is the volume element for integration (i.e. dr for a line, Airr 2 dr 
for a spherically symmetric membrane, etc.). For uniform tension To (or 
tension with unknown scale but known variation with respect to r) and a 
B-spline basis for f(r), this formula can be integrated to give the familiar 
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P-spline penalty function: 



(4) 



E[f(r)]/V = ^9 T .Q.e 



where we have made the definitions: 



(5) 



V W/(r) 



A(r) • 6 



(6) 



Q 



y j^A(r) T .A(r)p(r)dr 



(7) 



V 



p(r)dr 



The relation for the derivatives of f(r), Eq. (5), follows from the linearity 
of B-splined functions in their parameters, 0. As we allow a separate Q for 
each additive function of one variable, A is simply a row-vector. In this 
report, we distinguish matrix- multiplication from scalar multiplication with 
the symbol 

Likening the penalty to the spline energy gives the following connection 
to statistical mechanics. A system allowed to exchange heat with its sur- 
roundings until thermal equilibrium is reached has a Boltzmann probability 
distribution: 



Where the inverse thermal energy (3 determines the scale of the energy fluc- 
tuations (and thus the spline roughness) allowed in the system, whose posi- 
tion is specified by coordinates x. Defining the product PVTq as A recovers 
the traditional penalty function and assigns the meaning of "dimensionless 
tension" to the penalty parameter, A. 

Applying Boltzmann statistics to our string system thus gives us an im- 
proper normal distribution, since the n string modes corresponding to coef- 
ficients of an n — 1 degree polynomial have been assigned a uniform prior 
due to their absence from the energy function. 



It is interesting to note that the above formula starts from a string energy 
density which is dependent on the shape of the function, not the number of 
spline parameters, p. However, the prior becomes dependent on p through 
the rank of Q when normalized by integration over 8. From a mechanical 
perspective, this occurs simply because larger numbers of parameters allow 



(8) 




(9) 



P(0|AJ) oc \(P- n )/ 2 e ~2 eT - c >- 
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the function much more free space to move. This makes the probability of any 
individual choice of 9 correspondingly less. From a Bayesian perspective, this 
can be justified by considering how the number of functions within a small 
region of function space grows with p. We have thus recovered the common 
element of all previous formulations of P-splines via invoking Boltzmann 
statistics on an intuitive physical model system. 

2.2. Penalty and Variance Prior Distributions. As noted in the intro- 
duction, however, the penalty parameter and sample variance remain to be 
dealt with. The standard Bayesian choice is a simple scale-invariant prior for 
A, giving Eq. 10 with E = as the complete prior. Although this approach 
works most of the time, it has one serious numerical issue which appears 
in the limit as A approaches infinity. This situation occurs when the sam- 
ple size is small and the derivative order is large, allowing the belief that 
an n — 1 degree polynomial is a possible solution to the inference problem 
(i.e. a singularity at A = oo). Because MCMC sampling alternates between 
drawing values for 9 and A, it can get stuck once the algorithm encounters 
a A large enough to force a polynomial solution. 

Because of the divergence at large A, this parameter requires an alter- 
native prior distribution as has been done by several authors. Indeed, the 
earlier non-Bayesian suggestions for penalty parameter selection implicitly 
suggest a prior on A. Our analogy to stretched strings provides an alterna- 
tive method for deriving such a prior. We gain a vital clue by the above 
noted degeneration of the sampling process into a least-squares polynomial 
fit. Since our problem formulation in terms of stochastic splines implies an 
aversion to such simple solutions, we need to find a way to explicitly add this 
aversion as prior information. Adding v observations of small displacements, 
= wf to our state of knowledge gives a Gamma distribution 
for A| [Eq = J2i w i) with parameters oq = u/2, bo = Eq/2. This approach 
was tried by Lang and Brezger [18] with v = 2 and found to give results 
dependent on the scale of the function measurements due to bias toward 
a o/fro ~ which is the prediction for A implied by the displacements assumed 
in the new prior. This prediction for A is effectively specifying an energy 
scale for measuring displacements in our string. Jullion and Lambert [15] 
propose to set v = 2 and provide a hyperprior for Eq/v. The corresponding 
interpretation is that v prior observations of the string displacement were 
made but subject to large uncertainty. 

However, these approaches assume some prior knowledge about both ends 
of the energy scale, where the original intent of the additional information 
was simply to eliminate numerical instability at the high tension (degenerate 
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solution) end. The usual difficulty with scale-invariant priors is divergence 
toward zero due to lack of observations of the corresponding process' scale. 
Here, observations of the tension parameter are already made through the 
movements of the function away from a polynomial form. Divergence toward 
infinity as noted above should only occur if the input points actually lie on 
a polynomial and no noise occurs in the system. In all other cases, such 
divergence is an unwanted numerical artifact. In order to deal with this 
artifact, suppose that we arbitrarily set a value for 9 T • Q • 9 at which we 
consider the spline to be a degenerate polynomial solution. Since we are 
not interested in variations of the function below this threshold, we propose 
modifying the standard scale-invariant prior to impose this condition on the 
sampling process by adding Eq to 9 T • Q • 9. As noted in the discussion above, 
this is equivalent to the limit of specifying a string containing a zero-point 
energy, Eq, without any explicit observations (i.e. v = 0). This argument 
shows that Eq is a measure of residual uncertainty that the solution is an 
n — 1 degree polynomial. 



P (0A|I) = P (0|A/)P (A|J) 

^ K A (p-n)/2-l e -!(0 T -Q-0+£ o ) 

Setting v = has important consequences for previously reported prob- 
lems with this prior [15], since the implied prior distribution for A now 
approximates the scale independent prior, in particular its cumulants ap- 
proach zero independent from Eq. In fact, the new prior distribution on In A 
is sigmoid-like, going from 1 at — oo to 1/2 at lnAi/2 — ln(21n2) — lnEo 
(with slope — (ln2)/2) to at oo. 

2.3. Bayesian Posterior Parameter Distribution. Assuming independent 
Gaussian likelihood functions with variance a 2 = 1/z, the posterior distri- 
bution is given by 

P {9Z\\DI) OC X (P-n)/2-l z M/2-l 

x exp {-§ (\\Y - D ■ 6\\ 2 + Vb) - ±{9 T ■ Q • + Eq) } . 
And the conditional posterior distributions are therefore 

(12) e\ — N(e,n) 

S 1 = AQ + zD T • D, 5T 1 -9 = zB T -Y 

(13) z\ r (M/2, (||D • 9 - Yf + V )/2) 

(14) A| • • • ~ T ((p - n)/2, (9 T ■ Q • 9 + E )/2) , 
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where we have formed a matrix D € Mj^xp by stacking row- vectors of spline 
coefficients B(r) from all observations. Similarly, Y is a column- vector of all 
the observed function values. 

An additional parameter appears because the same type of situation de- 
scribed for A can also occur for the inverse variance parameter when the 
number of measured data points is low compared to the spline resolution. In 
this case, the spline can exactly match the input measurements resulting in 
a singularity as z — > oo. This case has been studied in detail by Wahba and 
Wang [36]. Our remedy will be just as above, adding a minimum variance 
Vq into the residual sum of squares. 

Using MCMC techniques [29, 11] to sample the posterior distribution 
Eq. (12-14) is the direct analogue of observing the multiple positions of a 
spline device (as described in the introduction) in thermal equilibrium. Al- 
ternately, a maximum likelihood estimate can be carried out using relatively 
fewer iterations of the above steps by simply maximizing each conditional 
posterior and iterating until convergence. Both processes require a Cholesky 
decomposition for 9 at each iteration, which is an 0(p 3 ) operation. 

We can calculate the classical mean-squared error (MSE) of the function 
estimate /(r) using Eq. 12 to compare with standard results [23]. This pro- 
cedure gives an indication of the convergence toward a known 9q. This is 
in contrast to the Bayesian a posteriori error estimate for 6, which is given 
by the covariance matrix (z~D T ■ D + AQ)" 1 = K/(zM). The squared er- 



ror is expressed as M~ 



Y -Y 



i I (f( r ) ~ f( r )) 2 dr and its expectation 
is taken with respect to all possible variations of the random noise. For 
the one-dimensional case, we therefore assume the vector of observations is 
Y = D • #o + e, and the corresponding estimator and error PDF are: 

(15) Y = D K D T (D /M + z _1/2 e/M) 

(16) Y - Y\9o{r}f Xzl ~ N (D • K • • 9 , -^(D • K • D T ) 2 ) 

Where we have used 9 = K • (D T • D/M + ^Q) • 9 and e is a vector of 

M standard normal random variables. The expected error is then 

(17) 

MSE = M -1 |m -2 ||D • K • • 6» || 2 + z^Tt [(M^D • K • D T ) 2 ] } . 

The second term on the right hand side is reminiscent of the Bayesian 
estimate, while the first term represents the bias introduced by assuming 
function smoothness. For a fixed X/z, the total dependence on M for this 
term (remembering Tr(D T ■ D) = M for periodic splines) is order M~ 2 . 
However, this is not a direct estimate of the bias of the full Bayesian inference 
scheme, which allows X/z to vary. 
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2.4. Multi- dimensional generalization. We first generalize Eq. (12) to 

the case where f(r) = Y2k=ifk( r ) * s the sum °f multiple additive scalar 
functions. We can do this by extending the basis for y(r) = D(r) • 9 by 
tacking on additional coefficients. For the case of Nf functions fk{ r ) = 
Bfc(r) • 8k with spline coefficients B/%(r) T G M. Pk , this means that D(r) = 
[Bi(r), B2(r), . . . , BjVj, (r)] - taking the total number of parameters to be 

p = Ylk=iPk- Each additive function should have its own string energy 
prior, making A a vector of dimension Nf and Q a set of matrices penalizing 
their respective functions. The posterior probability for each element of A 
replacing Eq. (14) is thus 



In addition, when Nf > 1, each function can only be determined to within 
an additive constant, and identifiability constraints must be placed on all 
but one of them. 

Next, we generalize to vector- valued functions for models of the form 
f(ri) = T)i ■ 9, where D/ is an N x p matrix. This can be constructed by 
summing matrices formed from the tensor product of any direction vector 
gk(r) 6 WL N with the familiar spline coefficients: 



Identifiability problems are worse for the multidimensional case. For def- 
initeness, assume without loss of generality that each function has a unique 
argument Bk(r) = Bk(tk(r)). If two functions Bki and Bk2 have the same ar- 
gument but different directions, they can combined to make a single unique 
function Bk{rk) = (gki{ r ) + 9k2{r)) £g> B(r^) • 9k- Cases with the same argu- 
ment but different ranges are irrelevant, since their ranges can be combined 
to make a single, larger spline function or considered as two independent 
variables for the present analysis. If, on the other hand, two functions with 
different arguments share the same directional vector for all samples en- 
countered, an identifiability problem occurs. We therefore make the further 
assumption that the sample size is sufficiently large so that if two directional 
vectors differ in some region in the space of r, that region is included in the 
sample. Allowing multiple functions to share the same set of spline param- 
eters collapses the total set of functions to identify via combining functions 
Bk which share parameters 9k to make a total of Np unique functions. Note 
the use of capitalized indices for combined sets. 



(18) 




(19) 



fk{r) = g k {r)®B k (r)-9 k 



(20) 




k&K 
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This leaves us with genuine identifiability constraints which can be seen by 
examining 



Wherein Ok denotes the constrained Ok (i-e. its average has been subtracted) 
and Ck are arbitrary additive constants. The effect of constraining Ok dur- 
ing sampling is to force Ck to zero. Any direction in which {C} could vary 
while leaving f(r) unchanged for every r must therefore have a correspond- 
ing constraint. The number of constraints is thus determined by the rank of 
R G MmxJV/ j formed by the column-vectors D^l (associating K with the 
column). Using the definition (20), K ttK = YlkeK 5 f fc(n)(B fc (r;) • 1) = g K {n), 
the assumptions above imply that the number of constraints is equal to the 
number of persistent linear dependencies among gK{f*)- 

It is worthwhile to notice that a multidimensional basis function can be 
constructed from differentiation of a scalar function of the familiar linear 

mixed model type (e.g. f(r) = Yl /fcfa))- When such derivative in- 
formation is used, more data points than a corresponding sample from the 
scalar function are acquired with each sample. In this case identifiability 
problems only occur when the function can be simplified by reducing the 
dimension of {tK( r )}i F in a linear algebraic way. 

Using a multivariate Normal likelihood function and assigning each di- 
mension of the observation its own independent variance again gives a set of 
linear equations to solve during sampling, replacing the mean and variance 
of Eq. (12) with 



(22) S- 1 = diag(A)-Q + J^D^-D^, E"^ = J^D^i 



. Where we have formed a list of matrices D £ M.$xNxp by stacking € 
M^Vxp for all observations and Q is understood to be an appropriate block- 
diagonal matrix of Q/t-s. Correlations between elements of each measurement 
can be incorporated by a trivial modification as long as they remain a known 
function of r. 

If Nj dimensions share a common zi, the posterior probability for each 
element of z replacing Eq. (13) is 



(21) 




K=l 



N 



i=l 



(23) 
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3. Properties of the Proposed String Energy Prior. Most of 
the properties of penalized splines can be analyzed analytically for a fixed 
ratio, a = j. This is due to the existence of unique solutions to the one- 
dimensional spline fitting problem for an unrestricted function in a Hilbert 
space [17, 27]. In addition, Silverman [31] showed how the solution process 
can be understood in terms of a convolution kernel at the large sample 
size limit, establishing an analogy between convolution smoothing methods 
and spline smoothing [21, 22, 24, 1]. These results allow us to investigate 
resolution independence and the effects of differing penalty function orders, 
n. To show how this result comes about and analyze its limitations, we will 
sketch a short and intuitive derivation here, while noting that more extensive 
research on this analogy has been carried out by others. 

Next, in order to expand these results for variable a, we derive the 
marginal likelihood of the variance and penalty parameters. This allows us 
to understand the properties of a Bayes' estimate for 6 in terms of averages 
over a as well as study the trade-off between under or over smoothing. 

3.1. Resolution Dependence of P- Spline Estimation. A general formula 
for the Fourier transform of the convolution kernel [Eq. (36)] correspond- 
ing to spline smoothing with arbitrary derivative order, n, was first given by 
Silverman [31]. Subsequently, several authors improved the results on asymp- 
totic convergence of smoothing spline and convolution-based methods, in- 
cluding Messer [21] and Messer and Goldstein [22], where convenient approx- 
imate convolution kernels were derived for fixed sample spacing. Nychka [24] 
relaxed the equal spacing restriction and showed that the convolution kernel 
decays exponentially as long as the samples are spaced closely enough. Fi- 
nally, Abramovich and Grinshtein [1] provided a systematic derivation of an 
asymptotically equivalent convolution kernel for arbitrary derivative order, 
sample spacing, and variable tension. We will briefly sketch the results of 
the latter here, while providing the connection to finite element solutions as 
proposed here in order to establish resolution-independence. 

We begin with the task of finding the function u £ H n (0) (where H n (r2) 
is the standard Sobolev space of functions with square integrable derivatives 
on Q up to n th order) which minimizes the functional 

AI r 

(24) $[ U ] = £ - u(t,)) 2 + fa / p(t)u^\tfdt. 

1=1 Jn 

In the following discussion we are considering both the infinite, O = R, and 
periodic domains, f2 = [0, V). Next, note that we can re-formulate the sum 
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appearing on the right hand side as an integral using the definitions: 

y(t) = I £{»:«=*,} !' ^ 111 

I 0, otherwise 

M 

(25) 4>{t ) = } 1 Y J 5{t-t l ) : 

i=i 

so that $ can be conveniently expressed as a norm using the scalar product 

< a,b >= J^a(t)b(t)dt, where a denotes complex conjugation. 

(26) 

= const. {{yi}i f )+ < Mv - u),<f>h(y -u)> +p% < p h u {n \ Ph vr> > 

To keep the notation simple, the definitions <fif L = (ft 1 / 2 and p; L = p 1 ! 2 are 
made. 

The solution can be found by setting the functional derivative to zero 
since the functional is positive and has a unique minimum as long as p^ 
is bounded above zero everywhere in £1 and cp(t) contains at least n mass 
points [16]. We proceed as in Kimeldorf and Wahba [17] by using the scalar 
product invariance of the Fourier-Plan cherel transform [34] g{oS) = Fg = 
(2tt)~ 1 / 2 J n e lujt g(t)dt, as well as the convolution theorem F[ab] = (27r)~ 1 / 2 a* 
b to get 

(27) ^ = ~ 2 ( 27r )~ 1/2 < F [M>$h *u> +(2?r)- 1 < 4> h *u,4> h *u> 

+(2vr)" 1 ^| 7 < p h * ((-iu) n u),p h * ((-iuj) n u) > +const. 

Functional differentiation with respect to u(u) yields the minimum as the 
solution of 

(28) F[<Mj](u) = F[<fm] + ^"F^F" V«]] 



VM 

(29) cj>y = ci>u+^{-ir^{pu^) 



The last equation gives us the differential operator form of the mini- 
mization problem (24) [16, 35, 2]. The interpretation for continuous sample 
density <fi(t) is immediate, however it also remains valid for finite M in 
the following sense. On any interval between sample points, (j){t) as defined 
above is formally zero, and hence the solution u(t) is that of the homoge- 
neous equation - i.e. a 2n — 1 order polynomial when p is constant or (by 
definition) an L-spline for general p [17]. When integrating equation (29) 
over small regions, we can treat all terms in the equation on equal footing 
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at the sample points. The net effect of the sample points is thus to effect 
changes in the function's n th (and higher) order derivatives. 

The above argument (and references) show that the unconstrained solu- 
tion of Eq. (24) is a unique element of W n (Q), and therefore (by definition) 
nonparametric. To show precisely what is meant by a resolution independent 
spline estimate, we must prove that the spline function can approximate the 
nonparametric solution to arbitrary accuracy as the number of spline pa- 
rameters increases. To do this, we draw an analogy between the B-spline 
solution 8\a of Eq. (12) and the Ritz-Galerkin finite element (weak) solu- 
tion of Eq. (29). This lets us apply the typical error bounds for B-spline 
approximation [26]. 

Multiplying (29) through by a test function v(t) and integrating gives the 
bilinear form 

(30) a(u,v)= [ 0uv + ^pu {n) v {n) dt 

Ju 

To use standard procedures to get the approximation error in the induced 
norm \\u\\ a = y a(u, u) we first define the norms 

(31) \v\ k ,m = 



dt k 



dt, 



with |f|fe )00 as the maximum of over f2. 

Now, let Uh = T)({t}f I ) ■ 8 be the solution of Eq. (12) and note that it 
is the minimizer of \\u — Uh\\ a over all in the space of r th order Cardinal 
B-splines with knot spacing h (denoted S r h ). We thus have, for any Vh 6 



u h \\l < \\u - v h \\l 



u - v h 



M . 

i=i Jn 



(32) \\u - u h \\ 2 a <\u- + - v h \l t2 . 

Where we have defined /? max = \ph(u — %)|o,oo- Choosing Vh as a projection 
of u onto E> r h following Reif [26] , we find 

(33) \\u - u h f a < (dh r ) 2 + ^(C 2 h r ~ n ) 2 , 

where C\ and C 2 are constants proportional to |u|„ )00 . 

The approximation error thus falls into two regimes depending on the 
"effective" samples per interval h n ' \jMj (ap max ) . For large effective sample 
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sizes, the first term in Eq. (33) dominates and gives J jj J2i=i( u (^i) ~ u h{ti)) 2 ~ 



0(h r ). For small sample sizes, appropriate in the infinite resolution limit, 
the second term dominates and g 0(h r ~ n ) - both of which 

are of the same order as an optimal polynomial approximation to u. 

From Eq. (36) it is also easy to find a convolution kernel estimate for u. 
According to Eq. 12, this estimate is 



Which explicitly states our solution as a convolution with the kernel G (x, y), 
so that our function estimate is a linear combination of B-spline basis func- 
tions. Setting p = 1 in the large sample size limit, when (ft is approximately 
constant over a large range of knots, G is symmetric and its Fourier trans- 
form [from Eq. (28)] approaches [31] 



Convolution kernels for other several derivative orders, n, are plotted in 
Figure 1, while for general p and (j), Abramovich and Grinshtein [1] have 
provided a method to systematically derive such asymptotic approximations. 

These asymptotic approximations give a sense of how the algorithm changes 
in response to differing penalty orders, n. From the discussion following (29), 
as n increases, higher order derivatives of the function are changing at the 
sample points, leading to a larger width convolution kernel. Note that the 
asymptotic approximation plotted in Fig. 1 breaks down when M < p. How- 
ever resolution independence is still maintained; since in intervals without 
design points the solution will approximate the L-spline solution to (29). 
This provides a second interpretation to n as specifying the order of the 
"default" polynomial solution. 

3.2. Marginal distribution of the variance. As in the previous section, 
the change of variables from A to za uses the insight that the spline pa- 
rameter estimate, 9, depends only on the ratio a. This change makes the 
posterior PDF of section 2.3 into 



(35) 



(34) 




(36) 




P (6za\DI) oc a 



(p-n)/2-lJM+p-n)/2-l 



(37) 
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Fig 1. Convolution kernels, G(x — y), plotted for a/VM = {1,1/3,1/9} (solid black, 
dashed, thick grey, respectively), and cj> — 1. Higher sample numbers cause the kernel to 
steepen, placing more emphasis on x — 0, while higher a values do the opposite. 



The marginal probability density za\DI is found by integrating over the p- 
dimensional 9, assuming there are enough input measurements to determine 
the n improper dimensions of Q corresponding to an n— 1 degree polynomial 
fit. 

P (za\DI) = J P (6za\DI) d6 

a (p-n)/2-l z (M-ny2-l e -%(e}+V +a(e 2 Q +E )) 



(38) oc 



|D T D + aQ 



1/2 



e}=\\T>-0-Y\\ 2 ,e 2 Q = e T -Q 



This shows that the variance is Gamma distributed conditional on a. 
(39) P (z\aDI) ~ rf" 



M-n e 2 f + V + a(e 2 Q + E ) , 



Interestingly, this estimate for the variance is at odds with most procedures 
suggested in the literature [35]. Taking the inverse of the average, we get 
E [z\a)~ l = [e 2 f + Vo + a(eQ + Eq)]/{M — n), which resembles the square 
error estimate with M — n samples but includes the string energy, €q. An 
explanation for this could be that specifying a gives information about the 
error scale, z relative to average string deviation where specifying A alone 
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would not. This reveals some of the subtle distinctions which are often missed 
between A and a, especially in discussions of variance estimates. It also 
validates the interpretation of Eq and Vq as residual uncertainties and proves 
that the variance estimate is insensitive to them as long as they remain less 
than Cq and ej. 



3.3. Marginal distribution of the Penalty Parameter. Since z\aDI has 
a Gamma distribution, the normalizing constant for z is known and it can 
be integrated out of (38) to give 

Cp-n)/2-l e 2 + y 3 + a(£ 2 +E ^ 



(40) P (a\DI) oc 



|D T D + aQ 



1/2 



The first term and the determinant in the denominator can be thought 
of as offsetting terms, pushing smoothness higher until the point where the 
p — n eigenvalues of Q dominate the determinant. 

The second term is the most interesting if we remember that a repre- 
sents the ratio between the spline tension and spring constants. The above 
distribution says that we can learn about the over /under-smoothing trade- 
off through comparing the relative magnitudes of the function deviations, 
e^(a) + Vo and the roughness, eg (a) + Eq. As a increases, the function de- 
viations increase and the roughness decreases, so that the two energy scales 
can reach a balance. The appearance of Eq and Vq is natural in this context, 
since it prevents choosing either extreme. 

The (twice negative log of the) posterior likelihood (40) can also be com- 
pared to the AIC and GCV functions for choosing a which have been ex- 
tensively analyzed in the literature. Defining H(a) = D • (D T • D + aQ)D r , 
these two functions are: 

(41) GCV(a) = Me} (M - Tr(H)) -2 

(42) AIC{a\z) = ze) + 2Tr (H) . 

Where we have set z = 23.74~ 2 , following Eilers and Marx [8] and ignored 
constant terms. 

First, we note that neither of these two functions has a straightforward 
generalization for inference on vector-valued functions, so we must consider 
alternatives here. This is because there is not a one-to-one correspondence 
between individual functions to be added together and the dimensions of 
y(r), rather all functions have the ability to contribute to all output dimen- 
sions. 
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Fig 2. Comparison of alternate criteria for estimating a. In the left panel, GCV and AIC 
are as in Eq.s (41) and (42), while marginal denotes — 2 log P(a\DI), and all functions 
have been translated to set the minimum to zero. The right panel shows the posterior- 
average and uncertainty using the present method and GCV. 

Examining the criteria specified in this report, the GCV has very good 
properties for the one-dimensional case. Although the interpretation of a 
as specifying the relative scale between A and z is not clear with the GCV 
estimate, its minimum remains invariant to scaling both the function and its 
error (scale independence) since and H are functions of a only. However, 
alternatives such as the AIC or Mallow's C p criterion which require an a 
priori estimate for the variance do not necessarily have this property. 

Figure 2 compares the different functions of a for Hardle's motorcycle 
helmet data [8]. It can be seen that the GCV and AIC predict values in 
roughly the same neighborhood, while the present log-likelihood has a min- 
imum at a much lower a. Despite this, the present method differs very little 
from the GCV estimate in this example, while the AIC estimate (not shown) 
is indistinguishable from the GCV curves. We can therefore conclude that 
averaging over a range of a values does not spoil the fit to this data set, but 
allows additional smoothing (as can be seen from Fig. 1) as well as a more 
cautious error estimate. 

4. Example Problems. We consider some example cases which serve 
to test the scale independence of our method as well as illustrate the sig- 
nificance and usefulness of multidimensional function matching. First, the 
proposed prior is compared to other suggestions in the literature for a few 
simple test functions. Next, we progress to a high-dimensional example which 
attempts to describe the dynamics of 256 atoms. This system and others like 
it are numerically tractable since their behavior can be explained in terms 
of only a few additive functions, while the main difficulty of fitting these 
functions lies in separation of the measured sums. 

imsart-aos ver. 2009/08/13 file: stsplines.tex date: March 26, 2010 



RES. AND SCALE INDEP. FN. MATCHING 



19 



4.1. Method Comparison. We investigated the properties of various P- 
spline formulations using a standard set of model functions proposed by 
Lang and Brezger [18]. As in that report, p is chosen to be constant, and 
the derivative order is set to n = 2. The most important test functions 
for our current purposes are the linear function f\{r) = r/1.758 and the 
sinusoidal function (here modified from the original for exact periodicity) 
Hi?) = sin("7ra;/3)/0.72. The linear function tests the algorithm's stability 
in the degenerate polynomial solution case, and the sinusoidal function tests 
the algorithm's over/undersmoothing trade-off. The sinusoid function was 
modified because for the small number of samples used here almost all fits 
were noticeably linear when periodic B-splines were not used. These func- 
tions are defined over the range [—3, 3] and fit to 20 4 th order (cubic) B-spline 
knots using 20 equidistantly spaced noisy samples in order to force the solver 
into the low-sample regime. 

In order to compare across different versions of the algorithm and different 
scales within each version, the same set of independent random noise values 
generated from the standard normal distribution were appropriately scaled 
and used in all tests. All mean squared error results in this subsection have 
been renormalized via division by the square of the initial scaling applied. 
This makes for easier comparison, but requires us to keep in mind that larger 
error scales still produce worse fits in an absolute sense. All systems used 
MCMC Gibbs sampling of X,z,6 as described in section 2.3 with a burn-in 
of 2500 steps, followed by 25000 sampling steps, collecting one sample every 
5 steps for a total of 5000. Correlation times for A were always higher than 
those for z, and are included in our figures (right y-axis). Correlation times 
for 6 are not as important, since the conditional average (E (0\X, z)) has been 
collected with each sample. 

Figures 3 and 4 show boxplots of the (normalized) empirical root mean 
squared deviation (RMSE) of the spline from its respective function at the 
20 input sample points - log scale, left axis. Each figure shows the RMSE for 
the input samples (far left) as well as results from four different prior distri- 
butions for A. They are (from left to right): X: (A|I) ~ T(0, 10" 10 /2), (z\I) ~ 
r(0, 10- 10 /2); Y: (X\5I) ~ T(l, 5), (5\I) ~ r(10" 4 , 10" 4 ); Z 2 : (A| I) ~ T(l, 10~ 2 ), 
and Zq: (X\I) ~ T(l, 10 -6 ). For each function and prior distribution, several 
different noise scales were applied to the random deviations, with a indi- 
cated on the x-axis. In order to compare the relative error achieved with the 
efficiency of the method, the autocorrelation function of A was calculated 
and fit to an exponential for each test to estimate the correlation time. The 
average and standard deviation of this quantity over all 100 runs gives an 
indication of the number of MCMC steps required to draw an independent 
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Fig 3. Boxplots of error distributions for (left to right) variance of input samples, X: 
this report's method, Y: Lambert and Jullion's method, and Lang and Brezger's method 
using Z2: b = 10 -2 and Zq: b = 10 -6 . For each method, several noise scales are shown as 
explained in the text. Green lines (offset for visual clarity) display decorrelation time. The 
right panel shows the ratio between the sample RMSE and the estimated RMSE (left scale, 
calculated by averaging z^ 1 ) and the magnitude of posterior average penalty parameter. 
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Fig 4. Boxplots of error distributions, labeled as in Fig. 3. 
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sample. 

The linear case is an interesting smoothing limit to investigate, since the 
best approximation we can make is fitting to a polynomial of order n — 1, and 
the bias of the MSE derived in Eq. (17) is zero. A polynomial fit corresponds 
to A — > oo in Eq. (2), a case for which our numerical solver becomes unstable 
[see discussion in section 2.2], which indeed occurred during our test when 
we set Eq = 0. Our choice for Eq sets an effective maximum on A, which 
could lead to undersmoothing. However, the MSE and RMSE results of 
Fig. 3 from different methods are almost identical with the exception of Z<i , 
indicating that A was large enough. This figure shows that the A selected by 
each method is strongly influenced by their respective prior distribution. In 
particular, Z2 and Z§ show a clear preference for 10 -2 and 10 -6 , respectively. 
Methods X and Y seem to show more response to variations in the error 
scale, decreasing A as er 2 increases (implying a relatively stable choice for 
a). 

Fitting to the sinusoid function shows the under /oversmoothing tradeoff 
of each method. Figure 4 shows that all methods (with the exception of Z2 
and Zq) perform equally well predicting both the function and error scale. 
Methods Z2 and Zq break down at low signal to noise, where the strength 
of their prior distribution for A begins to outweigh the data. The similar 
performance of X and Y in this test indicates that they are less sensitive to 
their prior for A. This could have been predicted from the fitting results, since 
(for X) the function error, e 2 , deviates by more than Eq from a polynomial 
solution (see Sec. 3.3), and (for Y) A does not have too small a value. 

To see why the fitting results of Y may display sensitivity to the prior 
parameters, we can integrate the compound prior j Py (X5\ I) d5 to give 
P y (A|I) = (6/(A + b)) a with a = b = 1(T 4 as used here. This function 
deviates significantly from the Jeffrey's prior A -1 at small A, indicating a 
preference for larger A. However, it has a smoother decay toward zero at 
A — > 00 than X. This smoother decay at large A could also be achieved by 
adding a similar hyperprior for Eq in our method. 

The function fa was selected for a further test of scale-dependence. This 
was done by varying the overall scale of the function and its noise on a 
logarithmic scale from 10 -9 to 10 +9 . Each data set used for fitting contained 
50 equally spaced samples. Although a scale-dependence of the spline fit was 
noted by Lang and Brezger with the suggested remedy of standardizing the 
input samples, this strategy is difficult to justify if adaptive penalties are 
considered, and impossible for multi-dimensional measurements, where no 
linearly independent set of directions to standardize is guaranteed. None 
of the tests reported here make use of such standardization of input values, 
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Fig 5. Boxplots of error distributions showing variation with respect to total problem scale 
(W~ 9 to W 9 ), other symbols as in Fig. 3 



since an ideal method should produce identical (scaled) results for any choice 
of overall scale. 

The results of this test are shown in Fig. 5. As expected, Z<i and Z§ show a 
bias toward their respective prior A values, giving good MSE results for scales 
of 10" 1 or 10~ 3 . However, they show under/oversmoothing when the scale 
is below/above that value, respectively. Undersmoothing has a characteris- 
tically higher MSE and a larger range of a estimates, while oversmoothing 
shows a wider range of MSEs and outliers where a is overestimated along 
with a wide range of A correlation times. 

For X, the method breaks down at scales less than 10~ 5 . This happens 
because the scale of the input data is below the order of \/Eo, violating our 
assumption in setting Eq. To understand this in the context of our prior 
from Sec. 2.2, even as the data scale diminishes (reducing eg), A gets stuck 
at Eq and can't go higher. This can be seen directly in Fig. 5. The same 
process happens for z, which gets stuck at Vq as the scale diminishes. This 
has caused log 10 <5" 2 to bottom out at —11.5 at a scale of 10~ 6 and —11.6 
at a scale of 10 -9 [off the scale of Fig. 5]. Based on these observations, we 
can ask if €q/Eq or e^/Vo is too small (say less than 10) as an indicator of 
whether our method is predicting an exact polynomial fit or an exact fit of 
the input data. At large scales, these ratios are large, indicating that Eq and 
Vo have become irrelevant and scale-independence is achieved. As explained 
above for method Y, A must decrease at large scales, but the prior for A is 
too small close to zero, causing oversmoothing. 

4.2. Multidimensional Problems. Modeling of dynamic processes is usu- 
ally carried out via numerical integration in time (r) of, e.g. Newton's equa- 
tion of motion for a given potential function, E(r). 

(43) m^r = = Mr) 
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Where the system position is specified by r € R (N being three times the 
number of atoms, r being indexed by 3i + a). Standard molecular dynamics 
potential functions consist of multiple additive functions and are of the form 

N f 

(44) E (r) = X>*(*fc(0)- 

fe=i 

where each ifc(r) is a scalar function of r, e.g. a distance between two atoms. 
Attempting to match samples of the force {y, r} requires differentiating the 
above with respect to r to give the linear mixed model 

(45) f{r) = ^2-^^-E' k (t k (r)) + noise. 

k 

Here —dtk(r)/dr is an A r -dimensional vector with components —dtk(r)/dr^i +l 
giving the direction of each contribution to the total force, identified with 
gk(r) in section 2.4, and each E' k {tk{r)) = B/c • 8k is modeled via a B-spline 
basis. The most striking property of this model is that it holds to the simple 
linear mixed model form and is amendable to the same sorts of analyses 
carried out there. 

For a nontrivial example of this type of problem, we consider a system of 
256 Lennard-Jones particles with unit mass interacting via the pair potential 

(46) E(r)= Yl Ac ^ (\^ ~ r j\~ 12 ~ \n ~ r i\~ 6 ) ■ 

l<i<j<256 

Using the notation | r j — r,- 1 to mean the Euclidean distance between r, £ R 3 
and the closest periodic image of rj (since all the atoms have been placed 
into a cubic cell of length 7.49). By setting = 1/2 whenever 1 < % < 
128 and 129 < j < 256, or Cij — 1 otherwise, we have set up a two- 
component (binary) mixture. Due to their relatively low cross-attraction, 
particles of type A (1 < i < 128) and those of type B (129 < i < 256) have 
been found to be immiscible under the types of conditions used here [20], 
separating into two liquid layers within the simulation cell. The liquid state 
may be metastable with respect to a solid crystal phase, but such crystal- 
lization was not observed in any of the simulations reported here. 

Numerical integration of Eq. (43) with a time-step of 1.461 x 10~ 3 was 
used to simulate this system until steady-state behavior was observed. Five 
hundred configurations, {r}^ 00 , were obtained by sampling every 500 steps 
in a canonical ensemble simulation, which uses a slight (stochastic) modifi- 
cation of the equations of motion [33] to guarantee that the velocities dri /dr 
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are independently normally distributed about zero with variance f3~ , here 
chosen to be 0.7917. Although the units stated in this report make use of a 
reduced units convention, the problem remains unchanged if we set physical 
units of e = 1.26 kJ/mol, a = 3.7 A, T = 120 K, m = 16 g/mol, At = 1.935 
fs, and 7 = 5.1677 ps _1 . 

For consistency with Eq. (45), the set of forces was generated from the 
set of configurations using the following procedure. First, the three pairwise 
potential functions 4qj (t -12 — i~ 6 ) were replaced with their spline represen- 
tations on t £ (4/7, 17/7) and used to find the mean force corresponding to 
each configuration. Next, independent normally distributed random noise 
with magnitude of = 60.91 was added to each dimension of each gener- 
ated force sample. This procedure generates the same sample distribution 
as would be expected from a Langevin dynamics simulation with a moderate 
damping coefficient of jAt = 10~ 2 . 

Since we have three functions to match (A: A, A:B, and B:B), we use 
Eq. (20) with Nf = 3 to compute D; & E ~MNxp k for each frame and calculate 
the posterior mean and covariance using Eq. (22). In this equation, the sets 
K are assumed to include all terms of Eq. (46) which share a common set of 
parameters Ok (i-e. interactions between atoms of type A:A, A:B or B:B). 
Similarly, we will assume each particle type has its own Zi, so we use Eq. (23) 
with Ni = 2 and each set I is simply the set of all coordinates belonging to 
atoms a common type. 

The pairwise functions were fit to 6 th order B-splines with 170 knots 
and h = 0.1/7 to give a range of (0, 17/7), forcing the function and all its 
derivatives to zero at 17/7. MCMC sampling was carried out as described 
for the one-dimensional test cases except for the use of the density function 
p(r) = r 2 , which is more appropriate for the spherically symmetric function 
domains considered here. Figure 6 shows the fitting results and a comparison 
between the posterior average and maximum likelihood estimators for a 
variety of sample sizes. Corresponding distributions of the observed pairwise 
distances for the M = 50 case are shown in Fig. 7 as a function of sample 
size (left scale). For this case the present approach is contrasted with the 
generalized least squares solution on the right scale. 

Since the actual system does not allow observations of all pairwise dis- 
tances (particularly for small r), the average mean-squared error between 
the functions and their spline fits were calculated by averaging over the ob- 
served distances from all 500 samples. The range of observed distances is 
also the appropriate one for considering the approximation error estimates 
of section 3.1, which serve as upper bounds on any subset of Q where p > 0. 

An unusually high error is observed for the function describing interac- 
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Fig 6. Effect of sample size on average error of the fitted functions. 
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tions between particles of type A and B. Inspection of the spline estimates 
reveals that this error is due to oversmoothing (the MLE estimate was es- 
sentially zero) caused by the relatively small number of samples for this 
function, and so does not occur when the prior is set to zero - even for the 
M = 50 case (GLS, Fig. 7). This failure of the MLE makes it a worse choice 
than GLS, and could not have been predicted from calculation of the poste- 
rior function error. The expected error conditional on the MLE estimate for 
a, z is (in units of Fig. 6) —5.0 for all functions at M = 50. At M = 100, the 
A-A and B-B error estimates jump to around —1.9 and remain constant as 
M increases, while the A-B error estimate remains at —5.0 until M = 350, 
where it jumps to —2.2 and slowly increases to —2.1 at M = 500. These 
numbers significantly underestimate the fitting error at small sample sizes 
due to their neglect of variations in a, z. 

On the other hand, the posterior average estimate behaves as expected, 
approximating the input function with accuracy increasing with sample size. 
The expected function error using this method is slightly over-estimated due 
to the uncertainty in a, z — smoothly decreasing from —1.5, —1.6 for A-A,A- 
B at M = 50 to -1.9, -1.8 at M = 500. Examining the M = 250 case, the 
likelihood ratio of the posterior average estimate 9 to the MLE is 10 -585 , but 
the average estimate performs better, and must be used, because of the width 
of the posterior distribution. Considering 9 as a 510-dimensional vector, if 
the probability distribution for a\9 is relatively flat over a range R away from 
9, then there are R 510 "states," 6, a, z, for which 9 = 9\a and which therefore 
have a very low, but similar posterior probability. Integrating over a region in 
state-space is essential in justifying the astronomical difference in pointwise 
probabilities. Finally, as the sample size increases, the posterior probability 
narrows, causing both estimators to converge. The dramatic failure of the 
MLE shown in Fig. 6 demonstrates the importance of averaging over the 
posterior parameter distribution for small sample sizes. 

5. Discussion. This paper presents a re-derivation of the commonly 
employed Bayesian statistical model for penalized spline matching. Several 
questions are addressed relating to the applicability of the method for gen- 
eral problems. Its main limitations stem from the possibility of choosing an 
inadequate form for the function to be fitted, y(r). This can happen either 
by leaving out the dependence of y(r) on some important function of r - 
causing larger than necessary noise in the data set - or by choosing a reso- 
lution that is too low (too large bin width) in order to save computational 
time. This latter problem can result in an oversmoothed function and will 
again increase the fitted value of a 2 . 
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Physical concepts are important to justify use of the quadratic penalty 
parameter and give a useful interpretation to the fitting process. Using the 
energy function for states of a stretched string, we can apply the Boltz- 
mann distribution to derive a simple prior probability on function space. 
This method can be used to justify further adaptations of Eq. (3) and could 
prove to be generally useful in other problems of inference. Also by analogy 
to the physical system, we can anticipate a unique solution to the optimiza- 
tion problem Eq. (26). This proof is a classical result of the smoothing spline 
literature, and has been invoked here to show that our B-spline represen- 
tation also optimally approximates the unique minimum, converging as the 
resolution is increased. 

By explicitly considering the variation of the smoothing parameter with 
respect to function scale, we are able to clearly present and compare sev- 
eral choices for the smoothing parameter proposed in the literature. These 
classical choices are difficult to apply in the multidimensional context, and 
several have the additional possibility of scale dependence. Although the 
GCV criterion has the desired scale independence property, it also has the 
problem of predicting an exact match to the input data at low sample sizes 
[36]. This problem (and the related one of predicting an exact polynomial 
fit) was solved in this report by placing constraints on the ability of the 
smoothing parameter estimate to force such an exact fit. The result is an es- 
timator which is asymptotically scale independent for input data satisfying 
these minimum variance and polynomial deviation properties. 

Special consideration has also been given to formulation of vector-valued 
function fitting problems. An important observation is that the dimension- 
ality of these problems can make maximum likelihood estimation ineffective 
for small sample sizes. There have certainly been previous treatments of 
multidimensional fitting and integrator matching, both in considering vec- 
tor generalized additive models [38] and in adapting generalized linear least 
squares to use Bayes' theorem in physical systems [14, 19]. However to the 
author's knowledge, this is the first report of the extension of P-splines to fit 
the drift and diffusion terms of molecular Langevin dynamics simulations. 

The approach developed in this paper can be directly used to fit stochastic 
integration processes such as the Langevin systems considered here, giving 
a well-defined way to extrapolate forward in time. This would work par- 
ticularly well since the inference result gives the average and variance of 
each step in the integration process. However, although normally distributed 
force error is usually assumed in applications of the Langevin equation (and 
Euler integration of stochastic equations in general), the question of corre- 
spondence between dynamics where this assumption does not hold requires 
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separate consideration. 

Future research could also re-consider the creation of an "overall" scale- 
invariant prior for smooth z(r) where random noise is an issue in this process. 
This would be particularly important for variable volatility in market data 
or inhomogeneous molecular systems. 
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APPENDIX A: INTEGRATED SPLINE DERIVATIVES 

We present a method to calculate the integral (6) using B-spline basis 
functions [7] for general p and n. Klaus Hollig has presented a simpler 
method for this calculation when p is constant in Ref. [13]. For general 
p we proceed by expressing the r th order B-spline B(x; 9) as a set of piece- 
wise polynomials and calculate the contribution to the integral from each 
interval separately. 

To begin, we re-scale the spline value x to u = x ~^° + A which maps 
the spline range x G (xo, xq + L) to the interval u G (A, L/h + A). Here, A 
can be either r/2 for a periodic spline or r — 1 for aperiodic splines, where 
setting L/h past p — A forces successive derivatives of the spline to go to 
zero at L. Using this mapping, spline knots are placed at [0, 1, ... ,p — 1] to 
give the representation in terms of the B-spline basis function M r (u). 



(47) B(u;< 



Yh=o M r (u-i)0i x G (xq, xq + L), 
o.w. 



For periodic splines, x is circularly mapped into the correct range, and ar- 
guments of B and M as well as indices to 9 are understood as modulo p. 

Next, we note that M r (u) is nonzero only in the range u G (0, r), so 
the only necessary values of i in the above sum are i G (u — r, u) Pi Z = 
{ [u\ — r + 1, [nj — r + 2, . . . , \ u\ }. Defining d = u — \ u\ , and using the fact 
that this evaluation only requires parameters {0}W_. r+1 = ^*L«J' we nave 
(for any u G (A, L/h + A)) 

L«J r-l 
B(u; 9)= Yl M r( u ~ = M ^ + 1 " d ) d i+iu\-r+i 

j=L«J-r+l »=0 

(48) = P r (d; Mf ■ 9\ ul = $ ■ M • M . 

Defining the above sum as a dot product with (r — l) th order polynomial 
functions P r (d; M) T = [P r (d; M°), . . . , P r (d; AT -1 )] (M* denotes a column 
of the r x r matrix M) emphasizes the linear nature of the spline basis 
functions. To further compress the notation, define a vector of powers 
[d°, d 1 , . . . , cf - 1 ] so that P r {d; c) = dj ■ c and P r {d; M) T = d£-M. 

We can solve for the coefficients M 3 in terms of the standard polynomial 
coefficients for M r (u) = P r (u; ) by using the Binomial theorem to expand 
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M r (i + 1 - d). 

r— 1 r—1 3 / % 

(49) P r (i + ad; c) = ^ c,- (i + ad) j = ^ ( 3 J ^- fc a fc d fe 



j=0 j=0 k=0 

r-l 

hd k = dj -b 

k=0 

■ b k = [B(a, i) ■ c] k 



The last expression defines a lower-triangular matrix of binomial coefficients, 
(50) [B(M)] ifc 



otherwise. 



Comparing the terms in Eqs. 48 and 49, the columns of M are given by: 

(51) M j =B(-l,j+l) -d* 

Where c? are the polynomial coefficients used for normal B-spline interpo- 
lation on an interval j. A further simplification of (51) is possible by noting 
that P r (u; c Lu J ) = M r (u) = M r (r - u) = P r (r - u; cL r " u J ) so that any of the 
columns in (51) can be equivalently expressed as: 

(52) M j = B(+l, r - 1 - j) ■ c r ~ l - j 

specifically, replacing the last half of the columns (j = \r/2] , • • • , r— 1) only 

[r/2] 

requires the initial computation of [r/2] coefficient vectors, {c} — 1. 
We can also note because of linearity that 

(53) BW(u; 9) = 9f uj M T T> n d r . n . 

Which represents differentation using an r x (r — n) matrix D n with nonzero 
entries [D re ]y = — n)\ only on the diagonal i — j = n. 
Finally, we can assemble the integral 

(54) E = y- 1 Kfc 1+fc - 2n I 1 <7 (n) (n; 9) 2 {u + A + x /h) k du 

J u 

where we have restricted ourselves to the case where p(x)T(x) /Tq = Kx k for 
illustration and transformed x to u and (xo, xo + L) to (no, u\) as above. Now 
we decompose the integral into separate contributions from each interval in 
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{/} = {(tto, L^oJ + 1), (L«oJ + 1, L^oJ + 2), • • • , (L n iJ >«i)} to S ive a sum of 
integrals from each interval identified by its unique . 

! 1 g^\u;6f(u + /\ + x Q /h) k du 

= Y [ g {n) {u;6) 2 {u + A + x /h) k du 

= V / 9f ■ M T ■ D n • d r - n dj_ n T>1 M 0/(ti + A + x /h) k du 

(55) =2^/-Q/-^ 

{1} 

Assuming we have a general form for the integral of p times a power of x on 
each interval, we write 

(56) Qi = M T • D n • F • • M, 

with [F]ij = f I d i+ i(d+ [u\ + A + x /h) k dd, i,j = 0,l,...,r-n-l. 

The complete penalty matrix Q is then assembled according to (55) 
by shifting each of the matrices (56) into its proper location (multiplying 

Oj = {0}u/j_ r +i)- F° r periodic splines, p must also be periodic and the 
edges of this matrix are wrapped to correspond to the correct parameters. 
For aperiodic splines, edges are simply discarded, since their corresponding 
parameters have been assumed to be zero. 
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AVAILABILITY 

A python implementation of the methods used in this report have been 
made available as a part of the ForceSolve project on sourceforge.net 
(http://forcesolve.sourceforge.net). This software was created as a proof-of- 
concept for a general method to fit molecular dynamics data to arbitrary 
stochastic integration models. 
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